翻訳と辞書
Words near each other
・ Second-Hand Hearts
・ Second-hand shop
・ Second-hand Smoke
・ Second-harmonic generation
・ Second-harmonic imaging microscopy
・ Second-impact syndrome
・ Second-in-command
・ Second-language acquisition
・ Second-language acquisition classroom research
・ Second-language attrition
・ Second-language phonology
・ Second-level domain
・ Second-order
・ Second-order arithmetic
・ Second-order cellular automaton
Second-order co-occurrence pointwise mutual information
・ Second-order conditioning
・ Second-order cone programming
・ Second-order cybernetics
・ Second-order election
・ Second-order fluid
・ Second-order intercept point
・ Second-order logic
・ Second-order predicate
・ Second-order propositional logic
・ Second-order simulacra
・ Second-order stimulus
・ Second-person narrative
・ Second-rate
・ Second-term curse


Dictionary Lists
翻訳と辞書 辞書検索 [ 開発暫定版 ]
スポンサード リンク

Second-order co-occurrence pointwise mutual information : ウィキペディア英語版
Second-order co-occurrence pointwise mutual information

Second-order co-occurrence pointwise mutual information is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus. PMI-IR used AltaVista's Advanced Search query syntax to calculate probabilities. Note that the ``NEAR" search
operator of AltaVista is an essential operator in the PMI-IR method. However, it is no longer in use in AltaVista; this means that, from the implementation point of view, it is not possible to use the PMI-IR method in the same form in new systems. In any case, from the algorithmic point of view, the advantage of using SOC-PMI is that it can calculate the similarity between two words that do not co-occur frequently, because they co-occur with the same neighboring words. For example, the British National Corpus (BNC) has been used as a source of frequencies and contexts. The method considers the words that are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative semantic similarity. We define the ''pointwise mutual information'' function for only those words having f^b (t_i, w)>0,
:
f^\text(t_i,w)=\log_2 \frac,

where f^t (t_i) tells us how many times the type t_i appeared in the entire corpus, f^b(t_i, w) tells us how many times word t_i appeared with word w in a context window and m is total number of tokens in the corpus. Now, for word w, we define a set of words, X^w, sorted in descending order by their PMI values with w and taken the top-most \beta words having f^\text(t_i, w)>0.
The set X^w, contains words X_i^w,
:X^w=\, where i=1, 2, \ldots ,\beta and
:f^\text(X_1^w, w)\geq f^\text(X_2^w, w)\geq \cdots f^\text(X_^w, w)\geq f^\text(X_\beta^w, w)
A rule of thumb is used to choose the value of \beta. The ''\beta-PMI summation'' function of a word is defined with respect to another word. For word w_1 with respect to word w_2 it is:
:
f(w_1,w_2,\beta)=\sum_^\beta (f^\text(X_i^,w_2))^\gamma

where f^\text(X_i^,w_2)>0 which sums all the positive PMI values of words in the set X^ also common to the words in the set X^. In other words, this function actually aggregates the positive PMI values of all the semantically close words of w_2 which are also common in w_1's list. \gamma should have a value greater than 1. So, the ''\beta-PMI summation'' function for word w_1 with respect to word w_2 having \beta=\beta_1 and the ''\beta-PMI summation'' function for word w_2 with respect to word w_1 having \beta=\beta_2 are
:
f(w_1,w_2,\beta_1)=\sum_^(f^\text(X_i^,w_2))^\gamma

and
:
f(w_2,w_1,\beta_2)=\sum_^(f^\text(X_i^,w_1))^\gamma
respectively.
Finally, the ''semantic PMI similarity'' function between the two words, w_1 and w_2, is defined as
:
\mathrm(w_1,w_2)=\frac+\frac.

The semantic word similarity is normalized, so that it provides a similarity score between 0 and 1 inclusively. The normalization of semantic similarity algorithm returns a normalized score of similarity between two words. It takes as arguments the two words, r_i and s_j, and a maximum value, \lambda, that is returned by the semantic similarity function, Sim(). It returns a similarity score between 0 and 1 inclusively. For example, the algorithm returns 0.986 for words ''cemetery'' and ''graveyard'' with \lambda=20 (for SOC-PMI method).
==References==

* Islam, A. and Inkpen, D. (2008). (Semantic text similarity using corpus-based word similarity and string similarity ). ACM Trans. Knowl. Discov. Data 2, 2 (Jul. 2008), 1–25.
* Islam, A. and Inkpen, D. (2006). (Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words ), in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.

抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)
ウィキペディアで「Second-order co-occurrence pointwise mutual information」の詳細全文を読む



スポンサード リンク
翻訳と辞書 : 翻訳のためのインターネットリソース

Copyright(C) kotoba.ne.jp 1997-2016. All Rights Reserved.